Code
library(tidyverse)
library(dplyr)
library(ggplot2)
#install.packages("corrplot")
library(corrplot)
library(caret)
library(smotefamily)
library(ROSE)
library(caret)Support Vector Machines (SVMs) are a type of supervised learning algorithm that can be used for classification or regression tasks. The main idea behind SVMs is to find a hyperplane that maximally separates the different classes in the training data. This is done by finding the hyperplane that has the largest margin, which is defined as the distance between the hyperplane and the closest data points from each class. Once the hyperplane is determined, new data can be classified by determining on which side of the hyperplane it falls. SVMs are particularly useful when the data has many features, and/or when there is a clear margin of separation in the data.
Fig: - Linear & Non-Linear Separable Data
Hyperplanes are decision boundaries that help classify the data points. Data points falling on either side of the hyperplane can be attributed to different classes. Also, the dimension of the hyperplane depends upon the number of features. If the number of input features is 2, then the hyperplane is just a line. If the number of input features is 3, then the hyperplane becomes a two-dimensional plane. It becomes difficult to imagine when the number of features exceeds 3.
Fig: - Hyperplane in 2D & 3D feature space
Support Vector Machines (SVM) have emerged as a potent tool in the realm of supervised learning, offering a robust mathematical framework for both classification and regression tasks. With a foundation rooted in principles such as structural risk minimization and kernel functions (Jakkula, 2006), SVM has demonstrated exceptional generalization capabilities, adeptly handling non-linear decision boundaries through kernel tricks (Jun, 2021; Deris, 2011). Despite challenges like computational cost and scalability (Bhavsar, 2012), the evolution of SVM has led to significant contributions in diverse fields, including pattern recognition, computer vision (Kecman, 2005), and even agriculture, where it aids in optimizing crop yield and disease identification (Kumar et al., 2017). The ongoing advancements in SVM research are geared towards refining algorithms and broadening their application spectrum, especially in the context of burgeoning data volumes (Yue, 2003).
In the financial and healthcare sectors, SVM has proven its efficacy in various applications. It has been utilized to construct reliable stock market prediction models by analyzing financial indices like Earnings Per Share (EPS) and Net Profit Growth Rate (NPGR) (Han, 2007). In healthcare, SVM has been instrumental in developing advanced diagnostic tools, such as the optimized SVM model for early dementia prediction (Javeed et al., 2023) and the multi-disease prediction model using an improved SVM-radial bias kernel approach (Harimoorthy & Thangavelu, 2021). These innovations underscore the potential of machine learning in revolutionizing healthcare by facilitating early diagnosis and personalized treatment plans.
SVM’s application extends to domains like online retail and network security, where it addresses complex challenges with remarkable efficiency. In online marketplaces, SVM combined with Particle Swarm Optimization has enhanced the accuracy of text classification for customer reviews (Sahara et al., 2023), providing valuable insights for sellers. In the realm of network security, innovative approaches such as combining SVM with naïve Bayes feature embedding have been proposed for intrusion detection, achieving high accuracy rates in identifying network threats (Jie Gu et al, 2021). Moreover, the development of hybrid methods for attack detection, which integrate SVM features with evolutionary algorithms and artificial neural networks, has shown significant promise in reducing dimensionality and training time while maintaining high detection accuracy (Soodeh Hosseini et al, 2020)
Machine learning techniques, particularly SVM, are revolutionizing various fields by addressing complex challenges with precision and efficiency. In healthcare, SVM has been applied to electronic health records for cancer classification, achieving high accuracy rates in identifying different types of malignancies (K. Ghanem et al, 2021). Furthermore, SVM’s versatility is evident in its application across domains such as finance, where it has been used to assess credit risk for small and medium enterprises in supply chain finance (Zhang, Hu, & Zhang, 2015), and in cloud-based services, where it ensures data confidentiality and decision verifiability in health monitoring systems (Liang et al., 2021). These advancements highlight the transformative potential of machine learning techniques in enhancing diagnostic accuracy, optimizing financial assessments, and ensuring secure cloud-based services.
Customer retention is a critical aspect for banks to ensure the sustainability of their operations. ABC Multinational Bank, in particular, places a strong emphasis on retaining its account holders. The primary objective of this analysis is to examine the customer data of the bank’s account holders to predict and prevent customer churn effectively.
The dataset under consideration contains information about account holders at ABC Multinational Bank, with the ultimate goal of predicting customer churn. The dataset comprises the following columns:
| Column Name | Description |
|---|---|
customer_id |
A unique identifier for each customer, not used in the analysis. |
credit_score |
A numerical representation of the customer’s creditworthiness. |
country |
The country in which the customer resides. |
gender |
The gender of the customer (e.g., male, female). |
age |
The age of the customer in years. |
tenure |
The number of years the customer has been with the bank. |
balance |
The current balance in the customer’s account. |
products_number |
The number of products the customer has with the bank. |
credit_card |
Indicates whether the customer has a credit card with the bank. |
active_member |
Indicates whether the customer is an active member. |
estimated_salary |
The estimated annual salary of the customer. |
churn |
The target variable, indicating customer churn (1 for churned, 0 for not churned). |
Source: - Bank Churn Dataset
Mathematical Intuition of Support Vector Machine
Consider a binary classification task where there are two classes, denoted by the labels +1 and -1. The input feature vectors (X) and the matching class labels (Y) comprise our training dataset.
Equation for hyperplane can be written as:
\(w^Tx+b=0\)
The vector W represents the normal vector to the hyperplane. i.e the direction perpendicular to the hyperplane. The parameter b in the equation represents the offset or distance of the hyperplane from the origin along the normal vector w.
\(d_i = \frac{w^Tx_i + b}{\|w\|}\)
where ||w|| represents the Euclidean norm of the weight vector w. Euclidean norm of the normal vector W
\(\hat{y} =\begin{cases} 0 & \text{if } w^T x + b \geq 0 \\ 1 & \text{if } w^T x + b < 0 \end{cases}\)
kernel function in SVM
In Support Vector Machines (SVM), the kernel function plays a crucial role in transforming the input feature space into a higher-dimensional space where the data can be linearly separated. This is particularly useful in cases where the data is not linearly separable in its original space. The kernel function computes the dot product between the feature vectors in this higher-dimensional space without explicitly mapping the vectors into that space, which is known as the “kernel trick.”
Common types of kernel functions include:
Linear Kernel: \(K(w,b)=w^Tx+b\). This is the simplest form of the kernel, used when the data is linearly separable.
Polynomial Kernel: \(K(w, b) = (1 + w^T.x b)^d\). This kernel maps the input features into a polynomial feature space, allowing for polynomial decision boundaries.
Radial Basis Function (RBF) Kernel: \(K(w, b) = \exp(-\gamma |w.x - b|^2)\). Also known as the Gaussian kernel, it maps the features into an infinite-dimensional space, providing a lot of flexibility for non-linear decision boundaries.
Each kernel function has its own set of parameters that need to be tuned for optimal performance. The choice of kernel function and its parameters can significantly impact the SVM model’s ability to capture the underlying patterns in the data.
The margin in SVM is defined as the distance between the separating hyperplane and the nearest data points from each class, known as the support vectors. The goal of SVM is to find the hyperplane that maximizes this margin, as a larger margin is associated with better generalization ability of the model.
Support vectors are the data points that lie closest to the decision boundary and are critical in defining the position and orientation of the hyperplane. These are the points that directly influence the shape of the decision boundary, as any small change in their position can alter the hyperplane. The SVM model is said to be “sparse” because only the support vectors contribute to defining the hyperplane, while other data points have no influence.
The objective function that SVM optimizes is a combination of maximizing the margin and minimizing the classification error. This is achieved through the minimization of the following objective function:
\(min_{w, b} \frac{1}{2} \|w\|^2 + C \sum_{i=1}^{n} \xi_i\)
Subject to the constraints:
\(y_i (w^T x_i + b) \geq 1 - \xi_i \quad \text{and} \quad \xi_i \geq 0 \quad \text{for all } i\)
where \(w\) is the weight vector, \(b\) is the bias term, \(C\) is the regularization parameter, \(\xi_i\) are the slack variables representing the degree of misclassification of the \(i\)-th data point, and \(y_i\) are the class labels.
The hinge loss function is used in SVM to penalize misclassifications. It is defined as:
Hinge loss = \(\max(0, 1 - y_i (w^T x_i + b))\)
The hinge loss is zero for correctly classified points that are outside the margin, and it increases linearly for points that are on the wrong side of the hyperplane or within the margin.
The optimization of the objective function involves finding the values of \(w\) and \(b\) that minimize the function, subject to the constraints. This is typically done using quadratic programming techniques.
In the initial stage of our analysis, we undertook several preprocessing steps to ensure the data was suitable for modeling. Since our dataset did not contain any null values, we focused on encoding categorical variables and scaling numerical features. The categorical variables, such as ‘Country’ and ‘Gender,’ were encoded using one-hot encoding to convert them into a format that could be easily used by our machine learning algorithms. For numerical features like ‘Credit Score,’ ‘Age,’ ‘Tenure,’ ‘Balance,’ and ‘Estimated Salary,’ we applied standard scaling to normalize their distribution, ensuring that no single feature would dominate the model due to its scale.
Our Exploratory Data Analysis (EDA) aimed to uncover patterns, detect anomalies, and test hypotheses about our data. We started with summary statistics to understand the central tendency, dispersion, and shape of the dataset’s distributions. For instance, we observed that the Credit Score ranged from 350 to 850, with a median of 659, and the Age of customers varied from 18 to 92 years, with a median age of 37 years.
We then proceeded to visualize the distribution of key variables using distribution plots. This helped us identify the skewness in the ‘Age’ distribution and the uniform distribution of ‘Estimated Salary.’ Pair plots were employed to explore the relationships between variables like ‘Age’ vs. ‘Estimated Salary’ and ‘Age’ vs. ‘Credit Score,’ providing insights into how different factors might influence customer churn.
Through our EDA, we also investigated the distribution of the target variable ‘churn’ across different geographical regions and examined how the number of products varied across different regions. Correlation plots were utilized to identify potential relationships between features, revealing a positive correlation between ‘Age’ and ‘Balance,’ and a negative correlation between ‘NumOfProducts’ and ‘Balance.’
In our feature engineering process, we transformed the ‘Gender’ column from categorical to numerical by encoding ‘Male’ as 1 and ‘Female’ as 0. We also applied one-hot encoding to the ‘Geography’ column to convert it into binary variables for each country, ensuring that our model could interpret these categorical features correctly. Additionally, we split our data into training and testing sets to evaluate the performance of our models on unseen data. To address class imbalance in our target variable, we employed the Synthetic Minority Over-sampling Technique (SMOTE), which helped create a more balanced distribution of classes. Finally, we scaled our data using the StandardScaler to ensure that all features contributed equally to the model’s performance, preventing any feature with larger values from dominating the model’s learning process.
Loading Libraries
library(tidyverse)
library(dplyr)
library(ggplot2)
#install.packages("corrplot")
library(corrplot)
library(caret)
library(smotefamily)
library(ROSE)
library(caret)Load Data
df <- read.csv("dataset/train.csv")head(df) id CustomerId Surname CreditScore Geography Gender Age Tenure Balance
1 0 15674932 Okwudilichukwu 668 France Male 33 3 0.0
2 1 15749177 Okwudiliolisa 627 France Male 33 1 0.0
3 2 15694510 Hsueh 678 France Male 40 10 0.0
4 3 15741417 Kao 581 France Male 34 2 148882.5
5 4 15766172 Chiemenam 716 Spain Male 33 5 0.0
6 5 15771669 Genovese 588 Germany Male 36 4 131778.6
NumOfProducts HasCrCard IsActiveMember EstimatedSalary Exited
1 2 1 0 181449.97 0
2 2 1 1 49503.50 0
3 2 1 0 184866.69 0
4 1 1 1 84560.88 0
5 2 1 1 15068.83 0
6 1 1 0 136024.31 1
Summary Statistics
summary(select(df, CreditScore, Age, Tenure, Balance, NumOfProducts, EstimatedSalary)) CreditScore Age Tenure Balance
Min. :350.0 Min. :18.00 Min. : 0.00 Min. : 0
1st Qu.:597.0 1st Qu.:32.00 1st Qu.: 3.00 1st Qu.: 0
Median :659.0 Median :37.00 Median : 5.00 Median : 0
Mean :656.5 Mean :38.13 Mean : 5.02 Mean : 55478
3rd Qu.:710.0 3rd Qu.:42.00 3rd Qu.: 7.00 3rd Qu.:119940
Max. :850.0 Max. :92.00 Max. :10.00 Max. :250898
NumOfProducts EstimatedSalary
Min. :1.000 Min. : 11.58
1st Qu.:1.000 1st Qu.: 74637.57
Median :2.000 Median :117948.00
Mean :1.554 Mean :112574.82
3rd Qu.:2.000 3rd Qu.:155152.47
Max. :4.000 Max. :199992.48
Credit Score:
The Credit Score ranges from a minimum of 350 to a maximum of 850.
The median Credit Score is 659, indicating that half of the customers have a score below 659 and half have a score above.
The mean Credit Score is approximately 656.5, suggesting that the average creditworthiness of customers is in the mid-range.
The 1st quartile (25th percentile) is 597, and the 3rd quartile (75th percentile) is 710, indicating that 50% of customers have a Credit Score between 597 and 710.
Age:
The Age of customers ranges from 18 to 92 years. The median age is 37 years, meaning half of the customers are younger than 37 and half are older.
The mean age is approximately 38.13 years, indicating that the average customer is in their late thirties.
The distribution of Age is slightly right-skewed, as the mean is slightly higher than the median.
Tenure:
Tenure, or the number of years customers have been with the bank, ranges from 0 to 10 years.
The median tenure is 5 years, indicating that half of the customers have been with the bank for less than 5 years and half for more.
The mean tenure is approximately 5.02 years, suggesting that the average customer has been with the bank for around 5 years.
Balance:
The account Balance ranges from a minimum of 0 to a maximum of 250,898.
The median balance is 0, indicating that at least half of the customers have no balance in their account.
The mean balance is approximately 55,478, suggesting that while many customers have low or zero balances, some have significant amounts in their accounts.
Number of Products:
The Number of Products customers have with the bank ranges from 1 to
The median number of products is 2, meaning that half of the customers have 2 or fewer products with the bank.
The mean number of products is approximately 1.554, indicating that on average, customers have between 1 and 2 products with the bank.
Estimated Salary:
The Estimated Salary ranges from a minimum of 11.58 to a maximum of 199,992.48.
The median estimated salary is 117,948, suggesting that half of the customers have an estimated salary below this amount and half above.
The mean estimated salary is approximately 112,574.82, indicating that the average estimated salary of customers is around 112k.
Count Of Categorical value types
sapply(df[,c('Geography', 'Gender', 'HasCrCard', 'IsActiveMember', 'Exited')], function(x) length(unique(x))) Geography Gender HasCrCard IsActiveMember Exited
3 2 2 2 2
Checking null values
colSums(is.na(df)) id CustomerId Surname CreditScore Geography
0 0 0 0 0
Gender Age Tenure Balance NumOfProducts
0 0 0 0 0
HasCrCard IsActiveMember EstimatedSalary Exited
0 0 0 0
There are no null values in the data.
Distribution of target variable
table(df$Exited)
0 1
130113 34921
We can see number of customers exited are more compared to number of customers not exited. So there is a quite imbalance in data which needs to be addressed while building the model.
Distribution of target variable across Geography.
table(df$Geography, df$Exited)
0 1
France 78643 15572
Germany 21492 13114
Spain 29978 6235
France:
Germany:
Spain:
Which Gender has highest Credit Score?
aggregate(df$CreditScore, by = list(df$Gender), FUN = mean) Group.1 x
1 Female 656.2437
2 Male 656.6169
Observations:
The difference in average credit scores between male and female customers is minimal, indicating that gender does not significantly impact creditworthiness in this dataset.
Both genders have an average credit score in the mid-650s, which is considered a fair credit score range.
Distribution of Age.
ggplot(df, aes(x = Age)) + geom_histogram(binwidth = 5, fill = "blue", color = "black")Observations:
The largest concentration of customers falls within the 30 to 40-year-old range, indicating that the majority of customers are in their early to mid-career stages.
There is a significant drop in frequency as age increases, especially beyond 50 years. This suggests that the customer base skews younger.
The distribution is right-skewed, meaning there are fewer older customers (those over 60) compared to younger customers.
There is a small number of customers in the youngest age bracket (under 25 years) and the oldest (over 75 years).
Distribution of Estimated Salary:
ggplot(df, aes(x = EstimatedSalary)) + geom_histogram(binwidth = 5, fill = "blue", color = "black")Observations:
The distribution is quite uniform across different salary ranges, with no distinct peaks that would indicate a concentration of individuals around a specific salary bracket.
There are frequent spikes throughout the distribution, which may suggest that the data contains many unique values with small frequencies. This could be indicative of precise salary estimations rather than rounded figures.
The salaries range from very low values close to 0 up to 200,000, indicating a diverse group from potentially different economic backgrounds or job roles.
There is no obvious concentration of data points around the lower, middle, or upper salary range, which is unusual for income data where one typically expects to see more of a bell-shaped distribution centered around a median salary range.
Comparing the distribution of account balances between customers who have exited and customer who have not exited.
ggplot(df, aes(x = as.factor(Exited), y = Balance)) + geom_boxplot()Observations:
Balance Distribution:
The y-axis represents the balance on customer accounts, which seems to range from 0 to a bit over 250,000.
Both boxes have a similar interquartile range (IQR), which is the range between the first quartile (25th percentile) and the third quartile (75th percentile), represented by the height of the boxes. This suggests that the middle 50% of balances are similarly distributed between both groups.
The median, indicated by the line within each box, is roughly at the same level for both groups, suggesting that the central tendency of balance is similar regardless of whether the customer has exited or not.
Outliers:
How the distribution of the number of products varies across different geographical regions?
ggplot(df, aes(x = Geography, fill = as.factor(NumOfProducts))) + geom_bar(position = "dodge")Observations:
France:
Germany:
Spain:
Pairplot of Age vs Estimated Salary and also checking which age group and salary range have exited the bank.
ggplot(df, aes(x = Age, y = EstimatedSalary, color = as.factor(Exited))) + geom_point()Observations:
There doesn’t appear to be a clear pattern or correlation between Age and Estimated Salary with customer churn, as the exited and non-exited customers are interspersed throughout the plot without any distinct clustering.
Customers who have exited are spread across all ages and salary levels, but there seems to be a slightly higher concentration of churned customers in the 40 to 50 age range.
Pairplot of Age vs Credit Score and also checking which age group and Credit Score range have exited the bank.
ggplot(df, aes(x = Age, y = CreditScore , color = as.factor(Exited))) + geom_point()Observations:
There is a wide distribution of Credit Scores across different ages with no clear pattern indicating that Credit Score by itself may not be a strong predictor of customer exit.
Both exited and non-exited customers are found across the entire range of Credit Scores and Age, but there is a noticeable density of exited customers (blue dots) in the middle age range, particularly between ages 40 and 50.
Pairplot of EstimatedSalary vs Credit Score and also checking what Estimated Salary range and Credit Score range have exited the bank.
ggplot(df, aes(x = EstimatedSalary, y = CreditScore , color = as.factor(Exited))) + geom_point()Observations:
The scatter plot shows no clear correlation between Credit Score and Estimated Salary in predicting customer churn, with both customers who exited and those who did not evenly dispersed across all ranges of Salary and Credit Scores.
Correlation Plot
corr_matrix <- cor(select(df, CreditScore, Age, Tenure, Balance, NumOfProducts, EstimatedSalary))
corrplot(corr_matrix, method = "circle")Observations:
There seems to be a noticeable positive correlation between Age and Balance, and a negative correlation between NumOfProducts and Balance.
Churn Rate by Geography
churn_by_country <- df %>%
group_by(Geography) %>%
summarise(
Total_Customers = n(),
Churned_Customers = sum(Exited),
Churn_Rate = (sum(Exited) / n()) * 100
)print(churn_by_country)# A tibble: 3 × 4
Geography Total_Customers Churned_Customers Churn_Rate
<chr> <int> <int> <dbl>
1 France 94215 15572 16.5
2 Germany 34606 13114 37.9
3 Spain 36213 6235 17.2
ggplot(churn_by_country, aes(x = Geography, y = Churn_Rate, fill = Geography)) +
geom_bar(stat = "identity") +
geom_text(aes(label = round(Churn_Rate, 2)), vjust = -0.3) +
labs(title = "Churn Rate by Country",
x = "Geography",
y = "Churn Rate (%)") +
theme_minimal() +
theme(legend.title = element_blank(),
plot.title = element_text(hjust = 0.5)) df <- df %>% select(-id, -CustomerId, -Surname)names(df) [1] "CreditScore" "Geography" "Gender" "Age"
[5] "Tenure" "Balance" "NumOfProducts" "HasCrCard"
[9] "IsActiveMember" "EstimatedSalary" "Exited"
table(df$Gender)
Female Male
71884 93150
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import GridSearchCVdf=pd.read_csv('dataset/train.csv')df.head() id CustomerId Surname ... IsActiveMember EstimatedSalary Exited
0 0 15674932 Okwudilichukwu ... 0.0 181449.97 0
1 1 15749177 Okwudiliolisa ... 1.0 49503.50 0
2 2 15694510 Hsueh ... 0.0 184866.69 0
3 3 15741417 Kao ... 1.0 84560.88 0
4 4 15766172 Chiemenam ... 1.0 15068.83 0
[5 rows x 14 columns]
df.drop(['id','CustomerId','Surname'],axis=1,inplace=True)df['Gender'] = df['Gender'].replace({'Male': 1, 'Female': 0})df.head() CreditScore Geography Gender ... IsActiveMember EstimatedSalary Exited
0 668 France 1 ... 0.0 181449.97 0
1 627 France 1 ... 1.0 49503.50 0
2 678 France 1 ... 0.0 184866.69 0
3 581 France 1 ... 1.0 84560.88 0
4 716 Spain 1 ... 1.0 15068.83 0
[5 rows x 11 columns]
df = pd.get_dummies(df, columns=['Geography'])X = df.drop('Exited', axis=1)
y = df['Exited']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)smote = SMOTE(random_state=42)
X_train, y_train = smote.fit_resample(X_train, y_train)scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)svm_model = SVC(kernel='linear')
svm_model.fit(X_train, y_train)SVC(kernel='linear')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
SVC(kernel='linear')
y_pred = svm_model.predict(X_test)accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')Accuracy: 0.8136979661085415
class_report = classification_report(y_test, y_pred)
print('Classification Report:\n', class_report)Classification Report:
precision recall f1-score support
0 0.90 0.87 0.88 39133
1 0.55 0.62 0.58 10378
accuracy 0.81 49511
macro avg 0.72 0.74 0.73 49511
weighted avg 0.82 0.81 0.82 49511
The macro average f1 score is 73 which not that great performing hyper parameter tuning to increase the model performance.
Hpyer Paremeter Tuning using Grid search CV
#param_grid = {
# 'C': [0.1, 1, 10, 100], # Regularization parameter
# 'kernel': ['linear', 'rbf', 'poly'], # Kernel type
# 'gamma': ['scale', 'auto'], # Kernel coefficient for 'rbf', 'poly', and 'sigmoid'
#}#svm = SVC()
#grid_search = GridSearchCV(svm, param_grid, cv=5, scoring='accuracy', n_jobs=-1)#grid_search.fit(X_train, y_train)#print('Best Parameters:', grid_search.best_params_)
#print('Best Score:', grid_search.best_score_)svm_model = SVC(kernel='poly', C=0.1, degree=3, gamma='scale')
svm_model.fit(X_train, y_train)SVC(C=0.1, kernel='poly')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
SVC(C=0.1, kernel='poly')
#best_model = grid_search.best_estimator_
y_pred = svm_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')Accuracy: 0.8427016218618085
class_report = classification_report(y_test, y_pred)
print('Classification Report:\n', class_report)Classification Report:
precision recall f1-score support
0 0.91 0.89 0.90 39133
1 0.61 0.68 0.65 10378
accuracy 0.84 49511
macro avg 0.76 0.78 0.77 49511
weighted avg 0.85 0.84 0.85 49511
Jakkula, V. (2006). Tutorial on support vector machine (svm). School of EECS, Washington State University, 37(2.5), 3.
Kecman, V. (2005). Support vector machines-an introduction. In Support vector machines theory and applications (pp. 1-47). Berlin, Heidelberg: Springer Berlin Heidelberg
Yue, S., Li, P., & Hao, P. (2003). SVM classification: Its contents and challenges. Applied Mathematics-A Journal of Chinese Universities, 18, 332-342.
Jun, Z. (2021). The development and application of support vector machine. In Journal of Physics: Conference Series (Vol. 1748, No. 5, p. 052006). IOP Publishing.
Bhavsar, H., & Panchal, M. H. (2012). A review on support vector machine for data classification. International Journal of Advanced Research in Computer Engineering & Technology (IJARCET), 1(10), 185-189.
Deris, A. M., Zain, A. M., & Sallehuddin, R. (2011). Overview of support vector machine in modeling machining performances. Procedia Engineering, 24, 308-312.
Han, Shuo. “Using SVM with Financial Statement Analysis for Prediction of Stocks.” Communications of the IIMA Communications of the IIMA, vol. 7, 2007, scholarworks.lib.csusb.edu/cgi/viewcontent.cgi?article=1059&context=ciima.
Ahmadi, Muhammad Iqbal, et al. “SENTIMENT ANALYSIS ONLINE SHOP on the PLAY STORE USING METHOD SUPPORT VECTOR MACHINE (SVM).” Seminar Nasional Informatika (SEMNASIF), vol. 1, no. 1, 15 Dec. 2020, pp. 196–203, jurnal.upnyk.ac.id/index.php/semnasif/article/view/4101. Accessed 13 Feb. 2024.
Razzaghi, Talayeh, et al. “Multilevel Weighted Support Vector Machine for Classification on Healthcare Data with Missing Values.” PLOS ONE, vol. 11, no. 5, 19 May 2016, p. e0155119, https://doi.org/10.1371/journal.pone.0155119.
Öz, Ersoy, and Hüseyin Kaya. “Support Vector Machines for Quality Control of DNA Sequencing.” Journal of Inequalities and Applications, vol. 2013, no. 1, 4 Mar. 2013, https://doi.org/10.1186/1029-242x-2013-85. Accessed 15 June 2021.
“Support Vector Machine for Network Intrusion and Cyber-Attack Detection | IEEE Conference Publication | IEEE Xplore.” Ieeexplore.ieee.org, ieeexplore.ieee.org/abstract/document/8233268. Accessed 13 Feb 2024.
Kumar, Sachin, et al. “Precision Sugarcane Monitoring Using SVM Classifier.” Procedia Computer Science, vol. 122, 2017, pp. 881–887, https://doi.org/10.1016/j.procs.2017.11.450. Accessed 25 July 2019.
Javeed, A. et al. (2023) Early prediction of dementia using feature Extraction Battery (FEB) and optimized support vector machine (SVM) for Classification, MDPI. Available at: https://www.mdpi.com/2227-9059/11/2/439 (Accessed: 22 January 2024).
Nawal, Y., Oussalah, M., Fergani, B., & Fleury, A. (2022). New incremental SVM algorithms for human activity recognition in smart homes. Journal of Ambient Intelligence and Humanized Computing. https://doi.org/10.1007/s12652-022-03798-w
Zhang, L., Hu, H., & Zhang, D. (2015). A credit risk assessment model based on SVM for small and medium enterprises in supply chain finance. Financial Innovation, 1(1). https://doi.org/10.1186/s40854-015-0014-5
Harimoorthy, K., Thangavelu, M. RETRACTED ARTICLE: Multi-disease prediction model using improved SVM-radial bias technique in healthcare monitoring system. J Ambient Intell Human Comput 12, 3715–3723 (2021). https://doi.org/10.1007/s12652-019-01652-0
J. Liang, Z. Qin, L. Xue, X. Lin and X. Shen, “Verifiable and Secure SVM Classification for Cloud-Based Health Monitoring Services,” in IEEE Internet of Things Journal, vol. 8, no. 23, pp. 17029-17042, 1 Dec.1, 2021, doi: 10.1109/JIOT.2021.3075540.
G. N. Ahmad, H. Fatima, S. Ullah, A. Salah Saidi and Imdadullah, “Efficient Medical Diagnosis of Human Heart Diseases Using Machine Learning Techniques With and Without GridSearchCV,” in IEEE Access, vol. 10, pp. 80151-80173, 2022, doi: 10.1109/ACCESS.2022.3165792.
Sahara, S., Annida Purnamawati, Sulaeman Hadi Sukmana, Mely Mailasari, Erma Delima Sikumbang, & Puji, E. (2023). PSO optimization for analysis of online marketplace products on the SVM method. AIP Conference Proceedings. https://doi.org/10.1063/5.0129404
“Prediction of Consumer Purchasing in a Grocery Store Using Machine Learning Techniques.” Ieeexplore.ieee.org, ieeexplore.ieee.org/document/7941935.
Barakat, Nahla, et al. “Intelligible Support Vector Machines for Diagnosis of Diabetes Mellitus.” IEEE Transactions on Information Technology in Biomedicine, vol. 14, no. 4, July 2010, pp. 1114–1120, https://doi.org/10.1109/titb.2009.2039485.
“Applying Support Vector Machine to Electronic Health Records for Cancer Classification | IEEE Conference Publication | IEEE Xplore.” Ieeexplore.ieee.org, ieeexplore.ieee.org/abstract/document/8732906.
“An Effective Intrusion Detection Approach Using SVM with Naïve Bayes Feature Embedding.” Computers & Security, vol. 103, 1 Apr. 2021, p. 102158, www.sciencedirect.com/science/article/pii/S0167404820304314, https://doi.org/10.1016/j.cose.2020.102158.
Hosseini, Soodeh, and Behnam Mohammad Hasani Zade. “New Hybrid Method for Attack Detection Using Combination of Evolutionary Algorithms, SVM, and ANN.” Computer Networks, vol. 173, May 2020, p. 107168, https://doi.org/10.1016/j.comnet.2020.107168.